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Massively parallel DNA sequencing technologies are revolu- 
tionizing genomics research. Billions of short reads gener- 
ated at low costs can be assembled for reconstructing the 
whole genomes. Unfortunately, the large memory footprint 
of the existing de novo assembly algorithms makes it chal- 
lenging to get the assembly done for higher eukaryotes like 
mammals. In this work, we investigate the memory issue of 
constructing de Bruijn graph, a core task in leading assembly 
algorithms, which often consumes several hundreds of giga- 
bytes memory for large genomes. We propose a disk-based 
partition method, called Minimum Substring Partitioning 
(MSP), to complete the task using less than 10 gigabytes 
memory, without runtime slowdown. MSP breaks the short 
reads into multiple small disjoint partitions so that each 
partition can be loaded into memory, processed individually 
and later merged with others to form a de Bruijn graph. 
By leveraging the overlaps among the k-mers (substring of 
length k), MSP achieves astonishing compression ratio: The 
total size of partitions is reduced from O(fcn) to O(n), where 
n is the size of the short read database, and k is the length 
of a fc-mer. Experimental results show that our method can 
build de Bruijn graphs using a commodity computer for any 
large- volume sequence dataset. 

Source codes and datasets: graf ia. cs .ucsb . edu/msp 

1. INTRODUCTION 

High-quality genome sequencing is foundational to many 
critical biological and medical problems. Recently, mas- 
sively parallel DNA sequencing technologies [13] , such as II- 
lumina [2] and SOLiD pTJ, have been reducing the cost signif- 
icantly. The price for Human Whole Genome Sequencing at 
a 30X coverage has dropped to $3, 750 (www.knome.com). 
The massive amount of short reads (short sequences with 
symbols A, C, G, T) generated by these next-generation tech- 
niques [13] quickly dominate the scene. How to manage and 
process the Big Sequence Data becomes a database issue. 

A key problem in genome sequencing is assembling mas- 



sive short reads that are extracted from DNA segments. The 
number of short reads can easily reach one billion; and the 
length of each read varies from a few tens of bases to sev- 
eral hundreds. Figure [1] shows a sequence assembly process, 
where three short sequences are assembled to a longer se- 
quence based on their overlaps. 




ACTGATTATTACCGTA ATACTGAGCTCGGACA 
CGTATTCGTATCTATA 

ACTGATTATTACCGTATTCGTATCT ATACTGAGCTCGGACA 

Figure 1: Sequence Assembly 

The above process, called De novo assembly, has been ex- 
tensively studied in the past decade. There are two kinds of 
approaches: the overlap-layout-consensus approach [171119] . 
and the de Bruijn graph approach [18] [26] \M [5] [12] . The 
overlap-layout-consensus approach builds an overlap graph 
between short reads. Due to the sheer size of the overlap 
graph (each read can overlap with many other reads), this 
approach is more suitable for small genomes. The de Bruijn 
graph approach breaks short reads to k-mers (substring of 
length k) and then connects k-mers according to their over- 
lap relations in short reads. It can assemble larger quantities 
(e.g., billions) of short reads with greater coverage. 

Despite their popularity, large memory consumption is 
a bottleneck for both approaches [15]. For the short read 
sequences generated from mammalian-sized genome, algo- 
rithms such as Euler [18], Velvet [26], AUPaths [5] and SOAP- 
denovo [12] have to consume hundreds of gigabytes memory. 
Figure [2] shows a breakdown of memory and runtime con- 
sumption in SOAPdenovo [12] on a 258.7 GB Cladonema 
short read dataset and a 137.5 GB Lake Malawi cichlid 
(fish) short read dataselQ. Obviously, the most memory con- 
suming and time intensive part is the de Bruijn graph con- 
struction step. Similar results were also reported for other 
datasets [12]. In this work, we resort to a novel disk-based 



A de Bruijn graph based assembly process consists of six 
steps: error correction (optional), de Bruijn graph construc- 
tion, contig generation, reads remapping, scaffolding (op- 
tional) and gap closure (optional). The last two steps are 
applicable when pair end information is available. 
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approach to tackle this bottleneck, using less than 10 giga- 
bytes memory, without runtime slowdown. 
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Figure 2: SOAPdenovo: Statistics of Computational 
Complexity at Each Assembly Step 

In a de Bruijn graph, each vertex represents a k-mer. In 
order to build the graph, we have to identify the same k-mers 
scattered in different short reads. A straightforward solution 
is to build a hash table. We can encode each symbol, A, C, 
G, and T using 2 bits. In the aforementioned 137.5 GB fish 
datasets (the read length is 101), when k = 59, there are 
about 11.8 billion distinct k-mers including reverse comple- 
ments. Assuming a load factor of 2/3 for the hash table, we 
could expect the hash table to take nearly 283 GB memory, 
which is too large. 

Alternatively, one can apply a disk-based partition-merge 
approach, which is popular in databases. Given a set of 
short reads S, there are two classic scatter-gather methods 
to identify duplicate k-mers: (1) partition S horizontally 
into disjoint subsets, Si, S2, ■ ■ ■ , St, for each subset Si, gen- 
erate a hash table Hi of their k-mers in main memory, output 
a sorted copy Hi to disk, and then merge Hi, H%, . . . ,Ht; (2) 
partition all k-mers from S into disjoint subsets Si, S2, ■ ■ ■ , St 
based on their last few symbols, for each subset Si, create 
a hash table Hi, build a k-mer mapping and output Hi to 
disk, and then combine them. Both methods do not require 
a large amount of memory; but they are slow. The first so- 
lution, requiring multiple disk scans and sorts, is hopeless. 
The second one has to generate a huge number of k-mers in 
the first step. For the 258.7 GB Cladonema dataset, with 
k = 59, the disk file of k-mers is close to 3TB and the time 
used to finish duplicate mapping is around 30 hours. 

In this paper, we re-examine the second scatter-gather ap- 
proach and find a drawback existing in its k-mer partitioning 
strategy. Many k-mers generated from the same short read, 
though having large overlaps inside, are distributed to differ- 
ent partitions, which caused huge overhead. Inspired by this 
discovery, we introduce a new concept, called minimum sub- 
string partitioning (MSP). MSP breaks short reads to pieces 
larger than k-mers; each piece contains k-mers sharing a 
common minimum substring with fixed length p, p < k. The 
effect is equivalent to compressing consecutive k-mers using 
the original sequences. We demonstrate that this compres- 
sion approach does not introduce significant computational 
overhead, but could lead to 10-15 times smaller partitions, 
thus improving performance dramatically. It is observed 
that the size of MSP partitions is only slightly larger than 
the original sequences. Based on a random string model, 
we analytically derive the expected size of minimum sub- 
string based partitions, which is reduced from O(fcn) to 
O(n), where n is the size of the short read database, and 
k is the length of a fc-mer. Furthermore, we prove that the 
size of the largest partitions decreases exponentially with re- 
spect to p, indicating that it is very memory-efficient. When 



p = 12, the memory consumption is less than 10G for all the 
real datasets we tested. 

Our main contribution is the development of an innova- 
tive disk-based partitioning strategy for solving a critical 
graph construction problem in genome sequence assembly. 
Our solution is disk-based, using a small amount of mem- 
ory without runtime performance loss. To the best of our 
knowledge, our study is the first work that introduces mini- 
mum substring partitioning, studies its properties, and suc- 
cessfully applies it to de novo sequence assembly, a critical 
problem in genome analysis. Experimental results show that 
our method can build de Bruijn graphs using a commodity 
computer for any large- volume sequence dataset. 

2. PRELIMINARIES 

Definition 1 (Short Read, K-Mer). A short read is 
a string over alphabet E. A k-mer is a string whose length 
is k. Given a short read s, s[i,j] denotes the substring of s 
between the i t h and j t h (both inclusive) elements, s can be 
broken into m — k+1 k-mers, written as s[l, k], s[2, k+1], . . ., 
s[m — k + l,m]. K-mers s[i, k + i — 1], s[i + 1, k + i] are called 
adjacent in s. 

For a short read s, we can view k-mers generated in a way 
that a window with width k slides through s. Two k-mers, 
a and /3, are adjacent from a to P if and only if the last 
k — 1 substring of a is the first k — 1 substring of f3. Let S 
be a short read set S = {si}. A k-mer extracted from Si, 
s i\j>3 + k — 1]> i s written as Sij. 

Definition 2 (De Bruijn Graph). Given a short read 
set S — {si}, a de Bruijn graph G = {V, E} is constructed by 
creating a vertex for every distinct k-mer in S and connect- 
ing two vertices with a directed edge if their correspona 
k-mers are adjacent in at least one short read. 



(a) 

(b) 



Sequence: ACCAACGTTG 
AGCAACTCGT 




Figure 3: A de Bruijn Graph Example: k=3 

Figure[3]shows a de Bruijn graph generated from two short 
reads with k being 3. The edge weight shows the number 
of times the two adjacent k-mers appear in short reads. For 
sake of simplicity, we do not depict the k-mers generated 
by the reverse complements of short reads (see details in 
Section [4}. 

2.1 K-mer Mapping 

Given a short read dataset, in order to build a de Bruijn 
graph, one has to map all the duplicate k-mers derived from 
different short reads into the same vertex. If vertices are as- 
signed with integer id's, e.g., starting at 1, this is equivalent 
to mapping duplicate k-mers to the same id. This process 
is called K-mer Mapping. Once the mapping is built, by 
scanning the short reads, we can create the edge set for the 
de Bruijn graph naturally. Therefore, the task of building 
a de Bruijn graph is narrowed down to k-mer mapping and 
edge sequence generation. 
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2.2 Scatter/Gather 

One solution to the memory bottleneck issue is to chunk 
data to several partitions and process them separately |24l 
125] . In this section, we discuss two scatter/gather approaches 
derived from duplicate detection techniques and then show 
their space complexity. The first solution is called Horizon- 
tal Partition (H-Partition). 

1. Divide short read dataset S to disjoint partitions with 
equal size, Si, &, . . ., St, such that each partition can 
be loaded into memory. 

2. For each partition Si, insert fc-mers into a hash table 
Hi. Based on the insertion order, assign an increasing 
integer id, starting at I, to each distinct fc-mer. Let 
Mi be the k-mer mapping function in Si. Mi is local. 

3. For each partition Si, output all k-mers Sij (in this 
case, we need to output the k-mer itself and its index, 
(i, j)) together with the assigned id, in increasing order 
of (i,j). Let Pi be the output sequences. 

4. Merge {Pi} to generate a global mapping function M 
such that it satisfies the following constraint. For any 
k-mer 7 extracted from partition Sj, let Si be the 
partition with the smallest i that contains 7, then 
M( 7 ) = M« (7). 

The output size of Step 3 is O(fcn), where n is the size of 
the short read database, and k is the k-mer's length. Step 
4 in H- Partition is costly. It needs a sort/merge process to 
identify the duplicate k-mers in different partitions. 

The main issue of H-Partition arises from the fact that 
the multiple occurrences of the same k-mer are not located 
in the same partition. To overcome this issue, one common 
strategy is to do bucket partitioning. Let H be a hash func- 
tion of k-mer. We can generate t partitions by distributing 
k-mer Sij to the H(sij) mod t partition. We can also use 
k-mers' last several symbols to scatter them into different 
partitions. This classic approach is called Bucket Partition 
(B-Partition). 

1. Extract all k-mers from S and put them to disjoint 
partitions, Si, 5*2, . . . , St, according to H(si.j) mod t. 

2. For each partition Si, insert fc-mers into a hash ta- 
ble Hi and assign an increasing integer id, starting at 
S'lJSjl, to each distinct fc-mer based on the insertion 
order, where \Sj\ is the number of distinct fc-mers in 
partition Sj. Let M be this k-mer mapping function. 
It is clear that M is a global mapping function: each 
distinct k-mer in S will have one unique id. 

3. For each partition Si, output all k-mers Sij (in this 
case, we only need to output the index , not the k- 
mer string) together with the assigned id, in increasing 
order of Let Pi be the output sequences. 

4. Merge {Pi} in increasing order of 

While Step 4 in B-Partition is much faster than that in 
H-Partition, the total size of all the partitions is the same 
B(fcn), which could easily reach multiple terabytes for a 
large genome. In the following discussion, we introduce a 
new partitioning concept, minimum substring partitioning 
(MSP), that reduces the partition size to O(n). 



3. MINIMUM SUBSTRING PARTITIONING 

Bucket partitioning has high overhead since adjacent k- 
mers are likely distributed to different partitions, unless 
H(sij) mod t = H(si,j+i) mod t. Karp and Rabin [10] 
proposed a rolling hash function with the property that the 
hash value of consecutive k-mers can be calculated quickly. 
However, it is unknown whether there exists such a hash 
function that with high probability, two adjacent k-mers 
could be mapped to the same partition. In this study, we 
resort to another approach to bypass this problem. 

Definitions (Minimum Substring [20]). Given a string 
s, a length-p substring r of s is called the minimum p-substmng 
(or pivot substring) of s, i/Vs', s' is a length-p substring of 
s, s.t., r < s' (< defined by lexicographical order), s is said 
to be covered by r. The minimum p-substring of s is written 
as min p (s). 



ACTGATTATTAACCGTACAAATTT 



ACTG ATTATT AACC GTA 
CTG ATTATT AACC GTAC 
TGATTATT AACC GTACA 
GATTATT AACC GTACAA 
ATTATT AACC STACAAA 



Figure 4: Minimum Substring Partitioning 

Since two adjacent k-mers overlap with length k — 1 sub- 
string, the chance for them to have the same minimum p- 
substring (p < k) could be very high. Figure [4] illustrates 
that the first 5 k-mers have the same minimum 4-substring, 
AACC. In this case, instead of generating these 5 k-mers 
separately, one can just compress them using the original 
short read, to ACTGATT ATT AACC GTAC AAA, and out- 
put it to the partition corresponding to the minimum 4- 
substring AACC. Formally speaking, given a short read 
s = S1S2 • • ■ s m , if the adjacent j k-mers from s[i, i + k — 1] 
to s [i +j — l,i+j+k — 2] share the same minimum p-substring 
r, then one can just output substring SiSi+i . . . s i+ j + k-2 to 
partition H (r) mod t without breaking it to j k-mers. If j 
is large, this compression strategy will dramatically reduce 
the partition size and runtime. 

Definition 4 (Minimum Substring Partitioning). 
Given a string s = S1S2 • • • s m , P < k < m, minimum 
substring partitioning breaks s to substrings with maximum 
length {s[i,j]\i + k — 1 < j, 1 < i, j < m}, s.t., all k-mers 
in s[i,j] share the same minimum p-substring. s[i,j] is also 
called super k-mer. 

According to minimum substring partitioning, larger p 
will likely break a sequence to several segments with dif- 
ferent minimum p-substrings, thus increasing the total par- 
tition size. On the other hand, a smaller p will produce 
larger partitions that might not fit in the main memory. 

Figure[5]shows the distribution of partition size withp = 4 
on the bee, fish, cladonema and bird datasets (ref. to Ta- 
ble 1 for details). The partitions are sorted according to 
their sizes. There are several large dominating partitions. 
The value of p determines the total size of partitions and 
the expected size of the largest partitions. In the follow- 
ing discussion, using a random string model, we prove that 
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Figure 5: Partition Size Distribution 

the expected total partition size is 0(n), far smaller than 
<d(kn) in H-Partition and B-Partition. We will further show 
the lower and upper bound of the largest partition in MSP, 
which decreases exponentially with respect to p, indicating 
that MSP is very memory-efficient. 

3.1 Total Partition Size 

Let I be the average number of breaks that MSP intro- 
duces in a given sequence dataset. That is, on average, 
MSP adds I breaks to a sequence and divides it into multi- 
ple substrings s[ii,ji], s[ia,ja], s[ij+i, ji+i]. Let m be 
the length of individual short reads. Suppose there are n/m 
short reads, i.e., n is the dataset size. We have the following 
theorem. 

Theorem 3.1. The total partition size is 0(^n + n). 

Proof. Each break introduces a substring that overlaps 
its previous substring with k — 1 symbols. We have 
breaks. Hence, the total partition size is B(^n + n). □ 

Assume a random string model with four symbols 0, 1, 2, 
and 3, each having equal probability to occur. We first use 
a simulation method to demonstrate the average number of 
breaks for 1M short reads with length m = 100. 



k=59, p=10 
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Figure 6: Average Number of Breaks 
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Figure 6(a)| shows the expected number of breaks with 
respect to different p and k values. When p increases, the 
number of breaks increases. When k increases, the number 



of breaks decreases. Figure 6(b) shows the expected breaks 



of short reads with respect to different m values, with p = 10 
and k = 59. It is observed that the average number of breaks 
increases proportionally with respect to m. We prove this 
in the following theorem. 



Theorem 3.2. Let l(m,k,p) be the average number of 
breaks under minimum substring partitioning. In a random 
string model, l(m, k,p) oc (m — k). 

Proof. It is trivial to have / = when m — k, because 
the whole string has no break in this situation. Consider 
the difference between l(m, k,p) and l(m — 1, k,p). In an m 
length string, let Pi(fc,p)=Pr{the minimum p-substring of 
the last k-mer is different from the second last one}. This 
equals to Pi(fc,p)=Pr{the first or the last p-substring is the 
only smallest p-substring}. Since Pi is only related to the 
last k + 1 characters, it is not related to m. Then we have, 

l{m, k,p) = l(m - 1, k,p) + Pi(k,p) 
= --- = P 1 (k,p)-(m-k). 

□ 

Theorem 13.21 told us that I increases proportionally with 
respect to m — k, with a ratio of Pi. Now we examine the 
bound of P\(k,p). 




Figure 7: Illustration of Theorem 3.3 



Theorem 3.3. In a random string model, Pi(k,p) < f-py. 

Proof. Given any string s = snsi ... s^, we concatenate 
Sk and so to form a ring as depicted in Figure[7] The ring can 
generate k + 1 length- (k + 1) strings by starting at different 
positions: R j =s J s ]+1 . . . s u+k)mod(k+1) j = 0, 1, . . . , k. Let 
Sk, P = { length-(fc + 1) string whose first or last p-substring 
is the only minimum p-substring in it }. We have P\(k,p) = 
|Sfe i p|/4 fe+1 . Now we calculate at most how many Ri strings 
belong to Sk, P - Let r be one of the minimum p-substrings 
among all of the p-substrings in {-Ri}. For any Ri, if r is 
located inside Ri (neither in the head nor the tail), then 
Ri does not belong to Sk, P - In total, there are k — p Ri's 
satisfying this condition. So in these k + 1 Ri strings, at 
most p + 1 of them can possibly belong to Sk, P - This gives 
usPi(fc,p)<|±i. □ 



Corollary 3.4. In a random string model, the total par- 
tition size is 0(pn). 

PROOF. According to Theorems 13.11 and l3~3l + n < 

^^{p+l)+n<{p+l)n + n^O(pn). 

Since p « k, the total partition size 0(pn) is far smaller 
than Q(kn) in traditional partition methods. In practice, 
p is fixed as a small constant; thus the size becomes <d(n). 
In the following discussion, we present a stronger bound for 
the total partition size without this assumption. 

Theorem 3.5. In a random string model, for any integer 
a>0, Pi(fc + a,p + a) < 2 • Pi(k,p) + 2±2. 
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Proof. Let Sk, P = {s\\s\ — k + 1, s' first or last p- 
substring is the only minimum p-substring in s}, S^ p — 
{s\\s\ — k + 1, s' first or last p-substring is one of the min- 
imum p-substrings in s}. We have Pi(k,p) — |Sfc,p|/4 fe+1 . 
Let P 2 (k,p) = \S^ p \/4 k+1 . Given a k + a + 1 length string 
t that belongs to Sk+a, P +a, consider the k + 1 length string 
consisting of the first k + 1 characters of t. Obviously it be- 
longs to 3%. p , so we have Pi(k + a,p + a) < P2(k,p). Hence, 
we only need to prove P 2 (k,p) — Pi(k,p) < P\(k,p) + ^r. 

Given a string s = S1S2 . . . Sfe+i which belongs to set 
iSfcp — Sk, P , we build an injective mapping from p — Sk, P 
to Sk,p- For the situation where the first p-substring of 
s is one of the minimum p-substrings, let s r be the first 
character that is not 0. Then we map s = S1S2 ■ • ■ Sfc+i to 
s' = si,...,s, — 1, s r — 1, s r +i, . . . , Sk+i- It is easy to see 
that s' belongs to Sk, P , except two situations: p-substring 
S1S2 ... s p is (1) 00 ... or (2) has only one 1 while all other 
characters are 0. These two situations have a probability of 
^tl (detailed proof omitted due to space limit). Similarly, 
for the other situation where the last p-substring of s is one 
of the minimum p-substrings, we map s = S1S2 . . . st+i to 
s — 81, . . . , s, — 1, s r — 1, s r +i, . • . , Sfe+i, where s r is the last 
character that is not 0. Then s" belongs to Sk,p, except for 
the case that p-substring Sk- p +2Sk- P +3 ■ ■ ■ Sfc+i is 00 ... 0, 
whose probability is Hence, IS" 
That is, P 2 (fc,p) < 2 -Pi(k,p) + ^±2. □ 

Assuming k — m/2, k < 100, p < k/5, we have 
kl — k ■ l(m,k,p) — k ■ P\(k,p) ■ (m — k) (Theorem \3.2 
< (2k ■ Px{k - p + 5, 5) + k ■ Jr) ■ (in - k) (Theorem^ 



S\ < |5| + £±2 . 4 fc+i. 



<(2 



k — p + 6 



6 + 0.7) - m/2 



(Tfeeorem l3.3 



100 



< (12 ■ — + 0.7) ■ m/2 < 7.4m. 



Therefore, — n + n < 8 An, which is much better than 
6(fcn). 

3.2 Largest Partition Capacity 

Since MSP has to load/hash each partition into main 
memory, the largest partition capacity, defined as the max- 
imum number of distinct k-mers contained by a partition, 
determines the peak memory. We study its upper bound 
and lower bound in a random string model. 

Theorem 3.6. In a random string model, the maximum 
percentage of distinct k-mers covered by one p-substring is 



bounded by 



4P+1 



when p>2. 



Proof. In a random string model, each symbol has equal 
opportunity to appear in each position of short reads. The 
probability of observing any length-m string is equal. As 
the smallest p-substring defined by lexicographical order, 
the partition built on the p-substring 00 ... has the largest 
number of distinct k-mers. 

Let a(k,p) denote the percentage of distinct k-mers cov- 
ered by p-substring 00 ... 0. This percentage is not related 
to m. For fixed p, there are two situations for a k-mer to 
have a p-substring 00 ... 0: (1) the first k — 1 characters have 
a p-substring 00 ... 0, or (2) the last p-substring is the only 
00 ... in this k-mer. Note that in the later situation the 
first k — p — 1 characters must not have a p-substring 00 ... 



and the (k — p) th character must not be 0. This gives us 

3 1 

a(k,p) = a(k - l,p) + (1 - a(k -p - l,p))j ■ 

Obviously, a(p,p) — and a(k,p) > a(k — l,p). Thus 

3 1 

a(k,p) = a(k - l,p) + (1- a(k - p - l,p))j • 7^ 
< a(k - l,p) + — ^-r <-j- + (k-p)- — ^rr 



41H 



3k 



< — — -,whenp > 2. 
4p+i — 



□ 



We can further establish a lower bound, 
a(k,p) = a(k — l,p) + (1 — a(k —p — l,p)) 
>a (k-l,p) + (l-a(k,p))-^ TT 

>i + ( fc -P)-( 1 -«( fc 'P))'i P TT' 
From above, if 4 < p < k/5, we have 

j < a(k,p) < —. 



3 _L_ 

4 ' 4f 
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Figure 8: The Bounds of a(k,p) 

Figure [8] depicts the bounds for the expected percentage 
of k-mers covered by the largest partition (corresponding to 
a p-substring 00 ... 0) with respect to different k values, p is 
set at 5. The result shows that the bounds we have proved 
are good: When k changes from 50 to 100, the maximum 
percentage of distinct k-mers covered by one minimum p- 
substring (the largest partition) is quite close to the lower 
and upper bounds we provided. 

To calculate the entire distribution of partition capacities 
(the number of distinct k-mers covered by each p-substring) 
in a random string model, we develop an efficient quadratic- 
time algorithm (0(m 2 ), see the Appendix). Using this algo- 
rithm, we do not need to use costly simulation to estimate 
the partition capacity. 

Figure [9] shows the expected distribution of partition ca- 
pacities with respect to different minimum substring lengths, 
assuming that 4 bases A, C, G, T appear with equal prob- 
ability and k-mer length is 59. Here the p-substrings are 
sorted according to the percentage of fe-mers they cover. 
The figure uses logarithm on both axes. The result shows a 
property: when p increases, there is a plateau where many 
p-substrings cover a similar percentage of distinct k-mers. 
We can conclude that there is no extremely memory con- 
suming partition when p is not very small. Furthermore, 
the peak memory of MSP can be fully controlled by p. 
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Figure 9: Expected Partition Capacity Distribution 

4. REVERSE COMPLEMENTS 

DNA sequences can be read in two directions: forwards 
and backwards with each symbol changed to its Watson- 
Crick complements (A o T and C o G). They are called 
reverse complement and considered equivalent in bioinfor- 
matics. Most sequencing techniques extract short reads in 
either direction. In an assembly processing, each sequence 
should be read twice, once in the forward direction and then 
in the reverse complement direction. 

Reverse complement is not an issue for bucket partition- 
ing: when a k-mer is read into memory, a reverse comple- 
ment can be built online. It becomes tricky for minimum 
substring partitioning since MSP intends to compress con- 
secutive k-mers together if they share the same minimum p- 
substring. Unfortunately, their reverse complements might 
not share the same minimum p-substring. This forces us 
to generate the reverse complement explicitly for each short 
read, which will double the I/O cost. 

Definition 5. [Minimum Substring with Reverse Com- 
plements] Given a string s, a length-p substring t of s is 
called the minimum p-substring of s, if Vs , s' is a length- 
p substring of s or s' reverse complement, s.t., t < s (< 
defined by lexicographical order). 

Definition [5] redefines minimum substring by considering 
the reverse complement of each k-mer. With this new def- 
inition, we need not output reverse complements explicitly, 
nor change the minimum substring partitioning process. In 
the following discussion, if not mentioned explicitly, we will 
ignore this problem. 

5. ALGORITHMS 

In this section, we describe the detailed algorithm to build 
a de Bruijn graph. It consists of three steps: Partitioning, 
Mapping and Merging. Each step is performed by a program 
that takes an on-disk representation of input and produces a 
new on-disk representation of output. The input of the first 
step is the raw short read sequences and the output of the 
last step is a sequence of id's mapped to the k-mers in short 
read sequences, in the same order; the duplicate k-mers shall 
have the same id. 

5.1 Partitioning 

The first step is to partition short reads using MSP. A 
straightforward approach is as follows: (1) given a short read 
s, slide a window of width k through s to generate k-mers, 
(2) for each k-mer, calculate its minimum p-substring, (3) 
find super k-mers in s (adjacent k-mers sharing the same 
minimum p-substring). This method has to calculate the 
minimum p-substring of every fc-mer. Each fc-mer needs 



(fe — p + 1) p-substring comparisons. Let m be the length 
of s. In total, this approach needs to perform (fc — p + 1) * 
(m — k + 1)—Q(mk) p-substring comparisons. 

The above solution does not leverage the overlaps among 
adjacent k-mers. When the fc-size window slides through 
s, we can maintain a priority queue on p-substrings in the 
window. Each time, when we slide the window one sym- 
bol to the right, we drop the first p-substring in the pre- 
vious window from the queue and add the last p-substring 
of the current window into the queue. Since the number of 
p-substrings in a window is k — p+1 and there are m — p+1 
p-substrings in s, the number of p-substring comparisons is 
0((m -p+ l)log(fc -p + l))=0(mlogfc). 

While the priority queue is theoretically good, the over- 
head introduced by the queue structure could be high. We 
thus introduce a simple scan algorithm, as described in Al- 
gorithm!]] Algorithm [1] first scans the window from the first 
symbol to find the minimum p-substring, say minja, and the 
start position of min_s, say min_pos. Then it slides the win- 
dow towards right, one symbol each time, till the end of the 
short read. After each sliding, it tests whether min_pos is 
still within the range of the window. If not, it re-scans the 
window to get the new min_s and min_pos. Otherwise, it 
tests whether the last p-substring of the current window is 
smaller than the current min_s. If yes, this last p-substring 
is set as the new min_s and its start position as the new 
min_pos. As analyzed in the previous section, adjacent k- 
mers likely have the same minimum p-substring. Therefore, 
it needs not to re-scan the window very often. Although 
the worst case time complexity is 0(mk) p-substring com- 
parisons, Theorem 15.11 shows it could be more efficient in 
practice since the average number of breaks is small. 



Algorithm 1 SimpleScan 

Input: String s = S1S2 ■ • ■ s m , integer k,p. 
min_s = the minimum p-substring of s[l, k] 
min_pos = the start position of min_s in s 
for all i from 2 to m — k + 1 do 
if i > min_pos then 

min_s = the minimum p-substring of s[i, i + k — 1] 
min_pos = the start position of min_s in s 
else 

if the last p-substring of s[i, i + k—1] < minjs then 
min_s = the last p-substring of s[i, i + k — 1] 
min_pos = the start position of min_s in s 
end if 
end if 
end for 



Theorem 5.1. Given an m-length string, assume mini- 
mum substring partitioning divides s into I + 1 substrings. 
Algorithm]!] needs at most 0(m + Ik) p-substring compar- 
isons. 

Proof. Algorithm[T]shows that min_s and min_pos change 
under two conditions: (1) i > min_pos, or (2) the last p- 
substring of s[i,i + k — 1] < minjs. Under the first condi- 
tion, it re-scans the k-mer s[i, i + k — 1], which introduces 
k—p+1 p-substring comparisons. Under the second condi- 
tion, it compares the last p-substring of s[i, i + k — 1] with 
the current min_s, which involves 1 p-substring comparison. 
Since the string s is broken into I + 1 substrings, min_s and 
min_pos changes for / times. If all these / changes are due to 
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the first condition, the total number of k-mer scans is I + 1, 
including the initial scan of the first k-mer. This results in 
(k — p + 1) * (I + 1) p-substring comparisons. Within each of 
these / + 1 substrings, it needs n t — 1 p-substring compar- 
isons to test the second condition, where n t is the number of 
k-mers within the substring s[it,jt]- For all Z + 1 substrings, 
the total number of p-substring comparisons due to this test 
is Ejjl^(nt — 1) = m — k — I. Therefore the total number 
of p-substring comparisons of Algorithm [1] is bounded by 
(k-p+l)*(l + l) + (m-k-l) = m + lk-pl-p+1, which 
is e(m + Ik). □ 

Definition 6 (Wrapped Partitions). Given a string 
set {si}, a hash function H , the number of partitions t, for 
any k-mer Sij, minimum substring partition wrapping as- 
signs Sij to the H(min p (sij)) mod t partition. 




Figure 10: Partition Wrapping 

Since each p-substring corresponds to one partition, the 
total number of partitions in MSP is equal to 4 P . Whenp in- 
creases, the number will increase exponentially. To counter 
this effect, one can introduce a hash function to wrap the 
number of partitions to any user-specified partition number. 
In this case, each partition generated from a p-substring is 
randomly included in a wrapped partition. The variance of 
partition sizes will likely decrease. Figure [TO] shows the dis- 
tribution of partition size when p = 10 and the number of 
wrapped partitions is set to 256. The number of partitions 
is the same as that of p = 4 without wrapping. In compar- 
ison with Figure [5] the partition size distribution is more 
uniform. 

5.2 Mapping 

In this step, each distinct k-mer is mapped to a unique 
integer id as its vertex id in the de Bruijn graph. A straight- 
forward solution is to process each partition one by one. For 
each partition, insert k-mers into a hash table. Whenever 
there is a k-mer that does not exist in the table, a new id 
is assigned to it. The starting id of k-mers in one partition 
is the maximum k-mer id of the previous partition plus one. 
After one partition is processed, a disk file (called id file) 
is created, the entries in hash table are written to that file. 
This approach works well for the mapping step. However it 
will cause a serious problem in the merging step. 

In the merging step, we scan the short reads again to build 
edges for adjacent k-mers. For each pair of adjacent k-mers, 
we need to locate their ids from their corresponding id files. 
Considering that the id files are as big as the partition files, 
it will cause a lot of I/O and seriously slow down the process. 
In order to solve this problem, we develop an id replacement 
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Figure 11: ID Replacement and Merging 

strategy, as depicted in Figure ITTI During the partitioning 
step, each k-mer is assigned an integer id. The id's are 
assigned increasingly from 1. The same k-mer in different 
short reads receives different id's. Our goal is to replace 
them with the first id it receives. For each partition, we 
create a hash table in memory. Whenever we see a new k- 
mer, we first look up the hash table to see if it exists: if yes, 
we write a replacement record into the id replacement file, 
indicating that the current k-mer is a duplicate and we have 
to replace its pre-assigned id with the id associated with its 
first occurrence. 

Figure [11] shows an example of the id replacement pro- 
cess. Assume k = 5, p = 3, and GTAATGAC occurs in 
two different short reads. In the beginning, each k-mer in 
two GTAATGAC is assigned a unique id, e.g, 7—10 and 
81 - 84, respectively. Sequences GTAATGA is sent to the 
AAT partition, while ATGAC is sent to the ATG partition. 
During this process, the k-mer 81 is mapped to the k-mer 
7, 82 to 8, and 83 to 9, while the k-mer 84 is mapped to the 
k-mer 10. To compress the replacement file, we write the 
replacement records as a range instead of multiple individ- 
ual records. For the example shown in Figure 111! one can 
just output a range record, 81 — > 7 : 3, meaning the 3 con- 
secutive id's starting at 81 will be replaced by 3 consecutive 
id's starting at 7. Range compression is quite effective since 
there are many long overlaps in short reads. According to 
our experiments, this kind of compression reduces the size 
of id replacement files to that of the original short read file. 

5.3 Merging 

After obtaining the id replacement files, the last step is 
merging. In this step, we merge all the replacement files to 
generate a sequence of id's that map to the original short 
reads, in the same order. We first open all the replacement 
files with each file header pointing to the first id replace- 
ment record of the corresponding file. Since all the files are 
already naturally sorted in increasing order by the first entry 
(the pre-assigned id's to be replaced) of replacement record, 
we can find the minimum id to be replaced in the current 
filer headers. We write its replacing id, move to the next re- 
placement record, and iterate. After this process, we get a 
sorted sequence of id's corresponding to k-mers in the short 
read dataset, in the same order. This actually forms a disk- 
based de Bruijn graph. The last step in Figure [11] shows 
this process, where two duplicated GTAATGAC sequences 
receive the same id sequence. This disk-based graph can ei- 
ther be distributed across multiple machines, or compressed 
[26] and loaded into memory. 

6. EXPERIMENTS 
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In this section, we present experimental results to illus- 
trate the memory efficiency, effectiveness and important prop- 
erties of the minimum substring partitioning method on 
four large real-life datasets: cladonema, bumblebee, fish, 
and bird. (1) We first analyze the efficiency of our graph 
construction algorithm in terms of memory and time cost, 
and compare it with two well-known open-source assembly 
programs, Velvet [26] and SOAPdenovo [12] • (2) The per- 
formance of MSP and two traditional partition/merge algo- 
rithms, H-Partition, B-Partition, are compared in terms of 
partition size and runtime. (3) We change different param- 
eter settings to demonstrate important properties of MSP. 
All the experiments, if not specifically mentioned, are con- 
ducted on a server with 2.40GHz Intel Xeon CPU and 512 
GB RAM. 
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6.1 Data Sets 

Four real-life short reads dataset are used to test our algo- 
rithms. The first one is the sequence data of Cladonema pro- 
vided by our collaborators. The bee, fish and bird datasets 
are available via http : //gage . cbcb.umd.edu/data/Bombus_impatiens 

http://bioshare.bioinformatics.ucdavis.edu/Data/hcbxz0i7kg/Ffs J * ure 12: Velvet ' SOAPdenovo, and MSP 



(b) Running Time 



and |http: //bioshare .bioinf ormatics .ucdavis . edu/Data/hc bxz0i7kg /Parrot/BGI_illumina_data[ 



respectively. Table 1 shows some basic facts. 
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the minimum substring length p at 12. For B-Partition, we 
use the last 4 symbols to partition k-mers. All the three al- 
gorith ms use t he similar amount of memory (around 10 GB). 
Figures 13(a) and 13(b) show the maximum disk space usage 



and the total running time. 



Table 1: Datasets: Cladonema, Bombus impa- 
tiens(bee), Lake Malawi cichlid(fish), and Budgeri- 
gar (bird) 

6.2 Efficiency 

We first conduct experiments to compare MSP with two 
real sequence assembly programs on de Bruijn graph con- 
struction: Velvet [26], a classic de Bruijn graph based as- 
sembler, and SOAPdenovo [12], a highly optimized and lead- 
ing assembler. For all the experiments, we set the k-mer 
length to 59 [5]. For MSP, we partition the short reads 
into 1,000 wrapped partitions with the minimum substring 
length p being 12. SOAPdenovo is optimized to support 
multithreading, we use 2 threads here to illustrate its ad- 
vantage. Both Velvet and MSP use 1 thread. The 8-thread 
version of SOAPdenovo can roughly achieve the same run- 
time as MSP. However, its peak memory consumption is still 
the same as its 2-thread version. 

Figure [12] demonstrates that MSP outperforms Velvet and 
SOAPdenovo in terms of memory usage and running time. 
For large datasets, Velvet and SOAPdenovo easily consume 
more than 150G memory, while our method can complete 
the task with less than 10G memory, an order of magnitude 
reduction of memory usagfl 

6.3 Effectiveness 

We then conduct experiments to compare MSP with other 
partition/merge algorithms, H-Partition and B-Partition. 
For all the three methods, we set the k-mer length to 59 and 
partition short reads into 1,000 partitions. For MSP, we set 



2 If the entire de Bruijn graph needs to be loaded in main 
memory, we have routines available that consume 20-30% of 
the memory that SOAPdenovo needs. 
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Figure 13: H-Partition, B-Partition, MSP 

Figure [131 shows MSP outperforms the two baseline meth- 
ods: compared with B-Partition, MSP can reduce the max- 
imum disk space usage by 10-15 times and reduce the total 
execution time by 8-10 times. B-Partition was adopted by 
out-of-core algorithms such as [11) . It implies that MSP is 
better than the classic approach that does not leverage the 
overlaps among data records. H-Partition's overall perfor- 
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mance is the worst since it needs multiple disk scans and 
sorts. 
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Figure 14: SimpleScan vs. Priority Queue 

We then illustrate the advantage of using a scanning method 
(Algorithm [TJ over a priority queue approach in the parti- 
tioning step. Here we set k at 59, p at 12 and partition the 
short reads into 1,000 wrapped partitions. Figure [T4l shows 
that the simple scanning method in Algorithm [1] is around 
2 times faster than the priority queue approach. Similar 
results were observed for other settings of p. 

6.4 Scalability 

We then conduct experiments to test the scalability of 
MSP. We vary the data size by randomly sampling the Clado- 
nema dataset. For MSP, we partition the short reads into 
1,000 wrapped partitions with p set at 10. 
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Figure 16: Scalability: Running Time 

Figures [15] and [16] show that all the three algorithms scale 
linearly in terms of peak memory consumption and running 
time. MSP performs the best. 

6.5 Properties of MSP 

Next we conduct experiments to illustrate the properties 
of minimum substring partitioning. Figure [17] shows the 
change of peak memory, partition size, and running time 
with respect to varying length of minimum substring. Here, 
we set the k-mer length at 59 and partition short reads into 



1,000 wrapped partitions. It shows that the peak mem- 
ory will decrease significantly when the minimum substring 
length is increased. The total partition size and the running 
time will slightly increase. Both increases are negligible, 
indicating that MSP is very effective in reducing memory 
consumption without affecting the runtime performance. 

We then fix p at 10, the number of partitions at 1,000, and 
vary the length of k-mers. Figure [18] shows the change of 
peak memory, partition size, and running time with respect 
to k-mer length. It shows that the peak memory increases 
slowly together with k. It is also observed that increasing 
k will reduce the total partition size and the running time. 
There are two effects inside. Given n short reads with length 
m, the total size of all the k-mers is equal to k(m — k + l)n. 
We have 



k(m - k + 1) 



(m+1) 2 



-(' 



1 



kf 



Hence, the size is peaked when k — (m + l)/2. The second 
effect is the compression ratio of MSP for larger k is higher. 
These two figures demonstrate the second effect dominates, 
since we do not observe a peak at k — (m + l)/2. The result 
is also in line with the analytical conclusion made for the 
random string model (see Theorems l3.2l and l3.6[) . 

7. RELATED WORK 

High throughput sequencing technologies are generating 
tremendous amounts of short reads data. Assembling these 
datasets becomes a critical research topic. With the de- 
velopment of next-generation sequencing techniques, the de 
Bruijn graph sequence assembly approaches became popu- 
lar, including Eulerp], Velvet [26], AllPaths[5], SOAPden- 
ovo [12], etc. 

All these de Bruijn graph based algorithms have to solve 
a critical problem in the process of constructing de Bruijn 
graph, which merges duplicate k-mers into the same ver- 
tex. When the number of short reads comes to the level of 
billions, the de Bruijn graph can easily consume hundreds 
of gigabytes of memory. Several algorithms have been pro- 
posed to solve the memory overwhelming problem of graph- 
based assemblers. Simpson and Durbin [22] adopted FM- 
index [9] to achieve compression in building the string graph 
[16] . which is an alternative graph formulation used in se- 
quence assembly (string graph is much more expensive to 
construct than de Bruijn graph, so it is not as popular as 
de Bruijn graph). However, the step of building the suffix 
array and FM-index is very time-consuming and memory- 
intensive. Orthogonally, Conway and Bromage [6] used suc- 
cinct bitmap data structure to compress the representation 
of de Bruijn graph. But the overall space requirement will 
still increase as the graph becomes "bigger" (more nodes 
and edges). Distributed assembly algorithms were also pro- 
posed, e.g., ABySS[23] and Contrail[21). They partition k- 
mers in a distributed manner to avoid memory bottleneck. 
Unfortunately, using a hash function to distribute k-mers 
evenly across a cluster cannot ensure adjacent k-mers being 
mapped to the same machine. It results in intense cross- 
machine communications since adjacent k-mers form edges 
in the graph. The proposed minimum substring partitioning 
technique solves this problem: it not only generates small 
partitions, but also retains adjacent k-mers in the same par- 
tition. 

The de Bruijn graph construction problem is related to 
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Figure 17: Varying Minimum Substring Length p 
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duplicate detection. The traditional duplicate detection al- 
gorithms perform a merge sort to find duplicates, e.g., Bit- 
ton and DeWitt [4]. Teuhola and Wegner [25] proposed an 
O(l) extra space, linear time algorithm to detect and delete 
duplicates from a dataset. Teuhola [24] introduced an ex- 
ternal duplicate deletion algorithm that makes an extensive 
use of hashing. It was reported that hash-based approaches 
are much faster than sort/merge in most cases. Bucket sort 
[7] is adoptable to these techniques, which works by parti- 
tioning an array into a number of buckets. Each bucket is 
then sorted individually. By replacing sort with hashing, 
it can solve the duplicate detection problem too. Dupli- 
cate detection has also been examined in different contexts, 
e.g., stream Q3] and text [3]. A survey for general dupli- 
cate record detection solutions was given by Elmagarmid, 
Ipeirotis and Verykios [8]. 

The problem setting of de Bruijn graph construction is 
different from duplicate detection in sense that elements in 
short reads are highly overlapped and a de Bruijn graph 
needs to find which element is a duplicate to which. The pro- 
posed minimum substring partitioning technique can utilize 
the overlaps to reduce the partition size dramatically. Mean- 
while, the three steps, partitioning, mapping, and merging 
for disk-based de Bruijn graph construction can efficiently 
connect duplicate k-mers scattered in different short reads 
into the same vertex. 

The concept of minimum substring was introduced in [20] 
for memory- efficient sequence comparison. Our work devel- 
ops minimum substring based partitioning and its use in 
sequence assembly. We also theoretically analyze several 
important properties of minimum substring partitioning. 

8. CONCLUSIONS 



We introduced a new partitioning concept - minimum sub- 
string partitioning (MSP), which is appropriate and efficient 
to solve the duplicate k-mer merging problem in the assem- 
bly of massive short read sequences. It makes use of the 
inherent overlaps among k-mers to generate compact parti- 
tions. This partitioning technique was successfully applied 
to de Bruijn graph construction with very small memory 
footprint. We discussed the relations between the partition 
size and the minimum substring length and analytically de- 
rived the capacity of minimum substrings based on a ran- 
dom string model. Our MSP-based de Bruijn graph con- 
struction algorithm was evaluated on real DNA short read 
sequences. Experimental results showed that it can not only 
successfully complete the tasks on very large datasets within 
a small amount of memory, but also achieve better perfor- 
mance than existing state-of-the-art algorithms. 
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10. APPENDIX 

We design an efficient polynomial-time algorithm for com- 
puting the probability that a given p-substring is the mini- 
mum p-substring of a random n-length string S. The com- 
plication arises because of the huge overlaps among the p- 
substrings of S: each p-substring shares p — 1 symbols with 
its predecessor, so these subproblems are not independent. 
We design a non-trivial dynamic programming algorithm 
that circumvents this complication, and leads to an 0(n 2 ) 
algorithm. Because the underlying problem is quite general, 
we find it best to describe the problem and its solution using 
the following abstract setting. 

Let S = S\S2 ■ ■ ■ s„ be a random string (the DNA se- 
quence), where each letter s; is an independent random 
variable taking values from the set E = {0,1,2,3} with 
probabilities po,pi,p2,P3, respectively. That is, s, assumes 
value j with probability pj, for j = 0,1,2,3, and these 
probabilities sum to 1, namely, Ysj=iPi = We will use 
the notation Si for the prefix substring of S of length i, 
namely, S1S2 ■ ■ ■ s*, and S(j) for its suffix substring of length 
j, namely, s n _j+i . . . s n . The notation S%(j) will be used for 
the j symbol long suffix of the prefix substring Si (j < i), 
namely, Si_j+i . . .Si (see Figure [19)). We will adopt the con- 
vention that substrings of length zero are empty; in partic- 
ular, Si(0) and So(j) are empty strings. Any two substrings 
of equal length can be compared using the lexicographical 
order, and we will use the standard notation <, <, =, >, > 
to denote their relative order. 

In order to distinguish the target string W from the DNA 
sequence, we will call the former a word. In particular, given 
an m-word W (to distinguish the abstract problem from the 
real problem, here we use m instead of p), also on the alpha- 
bet E = {0,1,2,3}, we wish to compute the probability 
that no m-substring of S is smaller than or equal to W . More 
specifically, what is the probability that Si (m) > W, for all 
i = m, m + 1, . . . , n. As we will argue later, if we know this 
probability for W and the m-word immediately preceding 
W in the lexicographical ordering, then by calculating their 
difference we can get the probability that W itself is the 
minimum m-substring, which is what we ultimately need. 

In order to build some intuition into the problem, let us 
consider the prefix Si . Let us call Si clean if it does not con- 
tain an m-substring < W . Suppose we inductively assume 
Si-i to be clean. Then, it follows that Si is clean only if 
Si(m) > W . In other words, to ensure that a prefix sub- 
string Si is clean we need two conditions: (1) the substring 
Si-i is clean, (2) the m-suffix of Si, Si(m), is larger than 
W . In fact, we will need these conditions to be recursively 
enforced, meaning that we will need Si(j) > Wj, for all j. 



St(i) 



S: s 3 s 2 - 



Figure 19: Illustration of S, Si, and the j-sufnx of 

Si. 

With this motivation, we now define the 2-dimensional 
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table Q, which will form the basis of our dynamic program- 
ming algorithm. The table Q has size (n + 1) x m, where 
the entry Q[i, j] holds the probability that Si is clean and 
Si(j) > Wj. Thus, Q[i, 0] is the probability that Si is clean, 
and the final value we wish to compute is Q[n, 0], which is 
the probability that the entire string S is clean, meaning it 
has no m-substring less than or equal to W. Of course, the 
probability that W is the minimum m-word in S is easily 
computed as Q'[n,0] — Q[n,0], where Q' is the same dy- 
namic programming table computed for the target m-word 
W', where W 1 is the immediate predecessor of W in the 
lexicographical ordering of m-words. 

Algorithm MinSTB (Minimum Substring Tail Bounds) 
describes in pseudo-code how to compute the Q table in 
row-major order, with the convention that Q[0, j] — 1 for 
all j. Assuming the first i rows of the table have been com- 
puted, the algorithm shows how to compute the row i + 1. 
The analysis of the algorithm is given in the following the- 
orem. 



Algorithm MinSTB: Computes the values Q[i + 1, j]. 
Require: < j < m, j < i < n 
if i + 1 < m then 

Q[i+1,0] = 1. 
Q[i+l,l] = Ek> ttl P*- 

Q[i+l,j] = J2l >Wj Pk + Q'-J I -v. ■ j>i. 
else 

Q[i+1,0] = Q[i,0]-J2 3 k>Wm p k + Q[i,m-l)-p w , 
if Wm > Wj and j > then 

Q[i+l,j] = Q[i,0]-Y? k>Wm Pk + Q[i,m-l]-p u 
end if 

if w m < Wj and j > then 

Q[i+l,l] = Q[i,0}-J2l >wl Pk. 

Q[i+l,j] = g '•(• •>;' „ P> + Q[i,j-l]-p w 
end if 

if Wm = Wj and j > then 

Q[i+l,l] = Q[i,0]-Efe> mi P*- 
if Wm-i(j-l) > Wj-i then 

Q[i+l,j] = Q[i,0]-T,l >Wj Pk + Q[i,m-l]-p. 
end if 

if Wm-i(j - 1) < Wj-i then 

Q[i+l,j] = Q[*,0] + Q[i,j-l]-p. 

end if 
end if 
end if 



Theorem 10.1. Given a random string S and an m-word 
W on E = {0, 1,2,3}, we can compute the probability that 
S has no m-substring <W in 0(n 2 ) time. 

Proof. We prove how Algorithm MinSTB correctly com- 
putes the (i+l)th row of the table Q from the ith row. First 
consider the case where i + 1 < m. Then Si+i has no m- 
substring, and we just need that Si+i(j) > Wj. If j = 0, 
then Q[i + l,j] = 1. If j = 1, then we simply need that 
Si+i > Wi. Thus 

3 

Q[t+l,l] = ^]p fe . 
Finally if j > 1, then all we need is that either (1) Si+i > Wj, 



or (2) Si+i = Wj and Si(j-l) > Wj-i. Therefore: 

3 

where Ysk>w P k ^ s ^ ne probability that Si+i > Wj, p Wj is the 
probability that Si+i = wj, and Q[i,j — 1] is the probability 
that Si is clean and Si(j — 1) > Wj_i. 

Now consider the case where i + 1 > m. Suppose Si is 
clean, then Sj+i is not clean if and only if Si+i(m) < W. 
This will not happen if and only if s i+ i > w m , or s i+ i — w m 
but Si(m-l) > VFm-l. Therefore 

3 

Q[i + 1,0] = Q[i,0]- ^ p k + Q[i,m-l]-p Wm , 

where Q[i,0] is the probability that Si is clean. 

The computation of Q[i + l,j] for j / depends on the 
value of Wm compared to Wj . This stems from the fact that 
if Wm < Wj, then we only need to compare Si+i against Wj, 
i.e., if Si(j-l) > Wj-\ then necessarily Si(m— 1) > W m -i- 
In fact, there are three cases: 

1. Wm > Wj. In this case if Si+i > w m > Wj then 5* cannot 
have any m-substring < W m , or any j-substring < Wj, 
which ends at index i. So all we need is for Si to be clean. 
If Sj+i < w m then Si+i is not clean. If Sj+i = u) m , then 
Si+i > and Si+i(j) > Wj. In this case we just need 
that Si(m— 1) > Wm-i, and we have 

Q[m,j] = Q[i,0]. ^ p fe + Q[i,m-1] .p mm . 

This holds if j = 1, since Sj+i > Wj even if s,+i = lOm. 

2. w m < uij. The argument is similar to the previous case: 
if Sj+i > Wj > w m , then all we need is for Si to be clean. 
If Si+i = Wj then if j = 1, 

Q[i+l,l] = Q[i,0] ■ 

but if j > 1 we need that Si(J — 1) > Wj-i. Therefore 

Q[i+l,i] = Q[i,0]-^p fc + Q[i,i-l]-p^. 

fc>ujj 

3. w m = Wj. If Si+i > w m then all we need is for Si to be 
clean. If Sj+i = ™ m = tuj, then if j = 1 

Q[i+l,l] = Q[i,0] ■ P^ 
If j > 1 there are two cases: 

• W m -i(j— 1) > Wj_i. Then if S^+i is clean, necessarily 
&+i0') > Wj because Sj+i = Wj and S , i+ i(m — 1) > 
Wm-i, which implies that Si+i(j — 1) > Wj-i. There- 
fore we only need to be clean, which happens only 
if Si(m — 1) > W m _i. In this case: 

Q[i+l,j] = Q[i,0]-^p fc + Qli.m-ll-p^.. 

fc>T*;j 

• W m -i(j — 1) < Wj_i. The argument is the same as 
the previous case, except that now we need Si(J — 1) > 
Wj-i, which happens with probability Q[i,j— 1]. Thus 

Q[i+M = QM]-^Pfc + Q[i,j-1] -P»j. 
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Each entry of the table can be computed in constant time, 
and therefore the whole table can be computed in 0(n 2 ) . □ 
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